bayesian attention module
Bayesian Attention Modules
Attention modules, as simple and effective tools, have not only enabled deep neural networks to achieve state-of-the-art results in many domains, but also enhanced their interpretability. Most current models use deterministic attention modules due to their simplicity and ease of optimization. Stochastic counterparts, on the other hand, are less popular despite their potential benefits. The main reason is that stochastic attention often introduces optimization issues or requires significant model changes. In this paper, we propose a scalable stochastic version of attention that is easy to implement and optimize. We construct simplex-constrained attention distributions by normalizing reparameterizable distributions, making the training process differentiable. We learn their parameters in a Bayesian framework where a data-dependent prior is introduced for regularization. We apply the proposed stochastic attention modules to various attention-based models, with applications to graph node classification, visual question answering, image captioning, machine translation, and language understanding. Our experiments show the proposed method brings consistent improvements over the corresponding baselines.
Bayesian Attention Modules: Appendix A Algorithm Algorithm 1: Bayesian Attention Modules
We follow the same architectural hyperparameters as in V eli ˇ ckovi c et al. We adopt hypothesis testing to quantify the uncertainty of a model's prediction. Acc (ans) = min{ (#human that said ans)/ 3, 1}. By stacking MCA layers, MCAN enables deep interactions between the question and image features. We conduct experiments on an attention-based model for image captioning, Att2in, in Rennie et al.
Review for NeurIPS paper: Bayesian Attention Modules
The improvement compared to deterministic attention seems marginal on some tasks such as VQA and machine translation. Unlike prior works on discrete latent variables that can make interpretability claims, I'm not sure why we want to model soft attention as a latent variable. Are they better under low resource scenarios? Also, can you do more quantification of the benefits of modeling attention uncertainties? Or even qualitatively showing a few examples of samples from the attention distribution and see if they truly reflect the underlying uncertainties.
Review for NeurIPS paper: Bayesian Attention Modules
This paper proposes considering attention mechanisms as continuous latent variables, using VAEs for training. It uses reparametrizable distributions such as Weibull and log-normal distributions to get unnormalized weights, which are then normalized. Experiments show that the proposed continuous latent attention mechanism gets better performance compared to deterministic attention on a wide variety of tasks, including image captioning, machine translation, graph classification, and fine-tuning BERT. All reviewers recommended acceptance, pointing out that this is an interesting idea and a solid and well-executed work. One concern was raised about the significance of improvement on VQA and NMT, and about directly setting prior to approximate posterior, which the authors addressed in the rebuttal.
Bayesian Attention Modules
Attention modules, as simple and effective tools, have not only enabled deep neural networks to achieve state-of-the-art results in many domains, but also enhanced their interpretability. Most current models use deterministic attention modules due to their simplicity and ease of optimization. Stochastic counterparts, on the other hand, are less popular despite their potential benefits. The main reason is that stochastic attention often introduces optimization issues or requires significant model changes. In this paper, we propose a scalable stochastic version of attention that is easy to implement and optimize.